The Photovoltaic Revolution

With climate change remaining a prevalent concern amongst many Americans for the foreseeable future, there has been some debate about how to approach the issue. Some consider the failure to act now a proverbial nail in the coffin, while others are optimistic that the rate of technological change will eventually help sort out this problem. William Nordhaus, who has been referred to as the leading mind in the economics of climate change, addresses the issue as a matter of careful cost-benefit consideration. In his book, The Climate Casino: Risk, Uncertainty, and Economics for a Warming World, he describes some of the nuanced and unknown issues pertaining to the economics of action against climate change and even puts into question some of the methods largely considered a best course of action.

Photovoltaic panels, or solar panels, are one of these aforementioned methods that, although has been proven to be efficient in some scenarios, is not a feasible first choice for clean electricity production on a massive scale. However, solar panels can supplement the net electricity consumption and many solar advocates claim that solar panels can decrease the price of electricity, not just for your own home or place of business, but for your neighbors as well. This has been the primary incentive for implementing solar panels into communities, however, we’d like to see if we can predict the use of solar panels in a given region, possibly to get some insight into characteristics that are associated with solar adoption. Although there are many incentives in place to increase public support and adoption of solar panels, there seems to be stagnation in the adoption rate of residential community-wide solar integration. We’ll attempt to predict the adoption of solar panels across the US using various machine learning algorithms.

Stanford's DeepSolar project used satelite imaging and computer vision to identify solar panels within the continental United States. Rather than undertaking a project to image the entire U.S., is there a way to predict the adoption of photovoltaic panels across the U.S. using demographic, economic, and geospatial data. The data can be found on the DeepSolar website: http://web.stanford.edu/group/deepsolar/home.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow
In [54]:
def plot_ann_auc(model,X_test,y_test,model_name='model_name'):
    n_probs = [0 for _ in range(len(y_test))]
    x_probs = model.predict_proba(X_test)
    x_probs = x_probs[:,0]
    n_auc = roc_auc_score(y_test, n_probs)
    x_auc = roc_auc_score(y_test, x_probs)

    print('No Skill Rate AUC: ',n_auc)
    print('Learned AUC: ',x_auc)

    n_fpr, n_tpr, _ = roc_curve(y_test,n_probs)
    l_fpr, l_tpr, _ = roc_curve(y_test,x_probs)

    plt.figure(figsize=(16,8))
    plt.plot(n_fpr,n_tpr,linestyle='--',label='No skill rate')
    plt.plot(l_fpr,l_tpr,marker='.',label=model_name)
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.legend()
    plt.show()
In [68]:
def plot_prob_auc(model,X_test,y_test,model_name='model_name'):
    n_probs = [0 for _ in range(len(y_test))]
    x_probs = model.predict_proba(X_test)
    x_probs = x_probs[:, 1]
    n_auc = roc_auc_score(y_test, n_probs)
    x_auc = roc_auc_score(y_test, x_probs)

    print('No Skill Rate AUC: ',n_auc)
    print('Learned AUC: ',x_auc)

    n_fpr, n_tpr, _ = roc_curve(y_test,n_probs)
    l_fpr, l_tpr, _ = roc_curve(y_test,x_probs)

    plt.figure(figsize=(16,8))
    plt.plot(n_fpr,n_tpr,linestyle='--',label='No skill rate')
    plt.plot(l_fpr,l_tpr,marker='.',label=model_name)
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.legend()
    plt.show()
In [2]:
data = pd.read_csv('deepsolar_tract.csv', delimiter=',',encoding='latin-1')
In [3]:
use_cols = ['tile_count','average_household_income','county','education_bachelor','education_college',
            'education_doctoral','education_high_school_graduate','education_less_than_high_school',
            'education_master','education_population','education_professional_school','land_area',
            'per_capita_income','population','population_density','poverty_family_below_poverty_level',
            'poverty_family_count','state','total_area','unemployed','water_area','employ_rate',
            'poverty_family_below_poverty_level_rate','median_household_income','electricity_price_residential',
            'electricity_price_commercial','electricity_price_industrial','electricity_price_transportation',
            'electricity_price_overall','electricity_consume_residential','electricity_consume_commercial',
            'electricity_consume_industrial','electricity_consume_total','household_count','average_household_size',
            'housing_unit_count','housing_unit_occupied_count','housing_unit_median_value',
            'housing_unit_median_gross_rent','heating_design_temperature',
            'cooling_design_temperature','earth_temperature_amplitude','frost_days','air_temperature',
            'relative_humidity','daily_solar_radiation','atmospheric_pressure','wind_speed','earth_temperature',
            'heating_degree_days','cooling_degree_days','age_18_24_rate','age_25_34_rate','age_more_than_85_rate',
            'age_75_84_rate','age_35_44_rate','age_45_54_rate','age_65_74_rate','age_55_64_rate','age_10_14_rate',
            'age_15_17_rate','age_5_9_rate','household_type_family_rate','dropout_16_19_inschool_rate',
            'occupation_construction_rate','occupation_public_rate','occupation_information_rate',
            'occupation_finance_rate','occupation_education_rate','occupation_administrative_rate',
            'occupation_manufacturing_rate','occupation_wholesale_rate','occupation_retail_rate',
            'occupation_transportation_rate','occupation_arts_rate','occupation_agriculture_rate',
            'occupancy_vacant_rate','occupancy_owner_rate','mortgage_with_rate','transportation_home_rate',
            'transportation_car_alone_rate','transportation_walk_rate','transportation_carpool_rate',
            'transportation_motorcycle_rate','transportation_bicycle_rate','transportation_public_rate',
            'travel_time_less_than_10_rate','travel_time_10_19_rate','travel_time_20_29_rate',
            'travel_time_30_39_rate','travel_time_40_59_rate','travel_time_60_89_rate','health_insurance_public_rate',
            'health_insurance_none_rate','age_median','travel_time_average','voting_2016_dem_percentage',
            'voting_2016_gop_percentage','voting_2012_dem_percentage','voting_2012_gop_percentage',
            'number_of_years_of_education','diversity','incentive_count_residential',
            'incentive_count_nonresidential','incentive_residential_state_level','incentive_nonresidential_state_level',
            'net_metering','feedin_tariff','cooperate_tax','property_tax','sales_tax','rebate','avg_electricity_retail_rate']

Dropped Features

The following features were dropped due to the redundancy, direct inference to the predicted variable or it's not necessary for domain understanding.

In [4]:
dropped = [col for col in data.columns if col not in use_cols]
np.array(dropped).flatten()
Out[4]:
array(['Unnamed: 0', 'solar_system_count', 'total_panel_area', 'fips',
       'employed', 'gini_index', 'heating_fuel_coal_coke',
       'heating_fuel_electricity', 'heating_fuel_fuel_oil_kerosene',
       'heating_fuel_gas', 'heating_fuel_housing_unit_count',
       'heating_fuel_none', 'heating_fuel_other', 'heating_fuel_solar',
       'race_asian', 'race_black_africa', 'race_indian_alaska',
       'race_islander', 'race_other', 'race_two_more', 'race_white',
       'education_less_than_high_school_rate',
       'education_high_school_graduate_rate', 'education_college_rate',
       'education_bachelor_rate', 'education_master_rate',
       'education_professional_school_rate', 'education_doctoral_rate',
       'race_white_rate', 'race_black_africa_rate',
       'race_indian_alaska_rate', 'race_asian_rate', 'race_islander_rate',
       'race_other_rate', 'race_two_more_rate', 'heating_fuel_gas_rate',
       'heating_fuel_electricity_rate',
       'heating_fuel_fuel_oil_kerosene_rate',
       'heating_fuel_coal_coke_rate', 'heating_fuel_solar_rate',
       'heating_fuel_other_rate', 'heating_fuel_none_rate',
       'solar_panel_area_divided_by_area', 'solar_panel_area_per_capita',
       'tile_count_residential', 'tile_count_nonresidential',
       'solar_system_count_residential',
       'solar_system_count_nonresidential',
       'total_panel_area_residential', 'total_panel_area_nonresidential',
       'lat', 'lon', 'elevation', 'voting_2016_dem_win',
       'voting_2012_dem_win', 'number_of_solar_system_per_household'],
      dtype='<U36')
In [5]:
df = data[use_cols]
In [6]:
df.tile_count.describe()
Out[6]:
count    72537.000000
mean        30.255787
std         86.337406
min          0.000000
25%          1.000000
50%          4.000000
75%         22.000000
max       4468.000000
Name: tile_count, dtype: float64
In [7]:
df.loc[df.tile_count == 0,'adoption'] = 0.0
df.loc[(df.tile_count>0),'adoption'] = 1
c:\users\patrick\anaconda3\envs\venv\lib\site-packages\pandas\core\indexing.py:845: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
c:\users\patrick\anaconda3\envs\venv\lib\site-packages\pandas\core\indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
In [8]:
df.adoption.value_counts()
Out[8]:
1.0    56258
0.0    16279
Name: adoption, dtype: int64
In [9]:
df.state = df.state.str.upper()
c:\users\patrick\anaconda3\envs\venv\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
In [10]:
state_df = df.groupby('state').sum()
In [11]:
fig = px.choropleth(locations=state_df.index, locationmode="USA-states", color=state_df.tile_count, scope="usa")
fig.show()
In [12]:
corr = df.drop(['tile_count','county','state'],axis=1).corr()
plt.figure(figsize=(30,30))
fig = sns.heatmap(data=corr,cmap='binary')
In [56]:
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score, f1_score, precision_score, roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV, train_test_split, ShuffleSplit
In [15]:
encoder = LabelEncoder()
df['county_code'] = encoder.fit_transform(df.county)
df['state_code'] = encoder.fit_transform(df.state)
df.loc[df.electricity_price_transportation==df.electricity_price_transportation.value_counts().index[0],'electricity_price_transportation'] = 0
df.electricity_price_transportation = df.electricity_price_transportation.astype(float)
c:\users\patrick\anaconda3\envs\venv\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

c:\users\patrick\anaconda3\envs\venv\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

c:\users\patrick\anaconda3\envs\venv\lib\site-packages\pandas\core\indexing.py:966: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

c:\users\patrick\anaconda3\envs\venv\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [16]:
[col for col in df.columns if df[col].dtype=='O']
Out[16]:
['county', 'state']
In [17]:
df.shape
Out[17]:
(72537, 116)
In [18]:
state_meds = df.groupby('state').median()
for col in state_meds.columns:
    print(col,' : ',round((state_meds[col].isna().sum()/72537)*100,2),'%')
tile_count  :  0.0 %
average_household_income  :  0.0 %
education_bachelor  :  0.0 %
education_college  :  0.0 %
education_doctoral  :  0.0 %
education_high_school_graduate  :  0.0 %
education_less_than_high_school  :  0.0 %
education_master  :  0.0 %
education_population  :  0.0 %
education_professional_school  :  0.0 %
land_area  :  0.0 %
per_capita_income  :  0.0 %
population  :  0.0 %
population_density  :  0.0 %
poverty_family_below_poverty_level  :  0.0 %
poverty_family_count  :  0.0 %
total_area  :  0.0 %
unemployed  :  0.0 %
water_area  :  0.0 %
employ_rate  :  0.0 %
poverty_family_below_poverty_level_rate  :  0.0 %
median_household_income  :  0.0 %
electricity_price_residential  :  0.0 %
electricity_price_commercial  :  0.0 %
electricity_price_industrial  :  0.0 %
electricity_price_transportation  :  0.0 %
electricity_price_overall  :  0.0 %
electricity_consume_residential  :  0.0 %
electricity_consume_commercial  :  0.0 %
electricity_consume_industrial  :  0.0 %
electricity_consume_total  :  0.0 %
household_count  :  0.0 %
average_household_size  :  0.0 %
housing_unit_count  :  0.0 %
housing_unit_occupied_count  :  0.0 %
housing_unit_median_value  :  0.0 %
housing_unit_median_gross_rent  :  0.0 %
heating_design_temperature  :  0.0 %
cooling_design_temperature  :  0.0 %
earth_temperature_amplitude  :  0.0 %
frost_days  :  0.0 %
air_temperature  :  0.0 %
relative_humidity  :  0.0 %
daily_solar_radiation  :  0.0 %
atmospheric_pressure  :  0.0 %
wind_speed  :  0.0 %
earth_temperature  :  0.0 %
heating_degree_days  :  0.0 %
cooling_degree_days  :  0.0 %
age_18_24_rate  :  0.0 %
age_25_34_rate  :  0.0 %
age_more_than_85_rate  :  0.0 %
age_75_84_rate  :  0.0 %
age_35_44_rate  :  0.0 %
age_45_54_rate  :  0.0 %
age_65_74_rate  :  0.0 %
age_55_64_rate  :  0.0 %
age_10_14_rate  :  0.0 %
age_15_17_rate  :  0.0 %
age_5_9_rate  :  0.0 %
household_type_family_rate  :  0.0 %
dropout_16_19_inschool_rate  :  0.0 %
occupation_construction_rate  :  0.0 %
occupation_public_rate  :  0.0 %
occupation_information_rate  :  0.0 %
occupation_finance_rate  :  0.0 %
occupation_education_rate  :  0.0 %
occupation_administrative_rate  :  0.0 %
occupation_manufacturing_rate  :  0.0 %
occupation_wholesale_rate  :  0.0 %
occupation_retail_rate  :  0.0 %
occupation_transportation_rate  :  0.0 %
occupation_arts_rate  :  0.0 %
occupation_agriculture_rate  :  0.0 %
occupancy_vacant_rate  :  0.0 %
occupancy_owner_rate  :  0.0 %
mortgage_with_rate  :  0.0 %
transportation_home_rate  :  0.0 %
transportation_car_alone_rate  :  0.0 %
transportation_walk_rate  :  0.0 %
transportation_carpool_rate  :  0.0 %
transportation_motorcycle_rate  :  0.0 %
transportation_bicycle_rate  :  0.0 %
transportation_public_rate  :  0.0 %
travel_time_less_than_10_rate  :  0.0 %
travel_time_10_19_rate  :  0.0 %
travel_time_20_29_rate  :  0.0 %
travel_time_30_39_rate  :  0.0 %
travel_time_40_59_rate  :  0.0 %
travel_time_60_89_rate  :  0.0 %
health_insurance_public_rate  :  0.0 %
health_insurance_none_rate  :  0.0 %
age_median  :  0.0 %
travel_time_average  :  0.0 %
voting_2016_dem_percentage  :  0.0 %
voting_2016_gop_percentage  :  0.0 %
voting_2012_dem_percentage  :  0.01 %
voting_2012_gop_percentage  :  0.01 %
number_of_years_of_education  :  0.0 %
diversity  :  0.0 %
incentive_count_residential  :  0.0 %
incentive_count_nonresidential  :  0.0 %
incentive_residential_state_level  :  0.0 %
incentive_nonresidential_state_level  :  0.0 %
net_metering  :  0.0 %
feedin_tariff  :  0.0 %
cooperate_tax  :  0.0 %
property_tax  :  0.0 %
sales_tax  :  0.0 %
rebate  :  0.0 %
avg_electricity_retail_rate  :  0.0 %
adoption  :  0.0 %
county_code  :  0.0 %
state_code  :  0.0 %
In [19]:
county_meds = df.groupby('county').median()
for col in county_meds.columns:
    print(col,' : ',round((county_meds[col].isna().sum()/72537)*100,2),'%')
tile_count  :  0.0 %
average_household_income  :  0.0 %
education_bachelor  :  0.0 %
education_college  :  0.0 %
education_doctoral  :  0.0 %
education_high_school_graduate  :  0.0 %
education_less_than_high_school  :  0.0 %
education_master  :  0.0 %
education_population  :  0.0 %
education_professional_school  :  0.0 %
land_area  :  0.0 %
per_capita_income  :  0.0 %
population  :  0.0 %
population_density  :  0.0 %
poverty_family_below_poverty_level  :  0.0 %
poverty_family_count  :  0.0 %
total_area  :  0.0 %
unemployed  :  0.0 %
water_area  :  0.0 %
employ_rate  :  0.0 %
poverty_family_below_poverty_level_rate  :  0.0 %
median_household_income  :  0.0 %
electricity_price_residential  :  0.0 %
electricity_price_commercial  :  0.0 %
electricity_price_industrial  :  0.0 %
electricity_price_transportation  :  0.0 %
electricity_price_overall  :  0.0 %
electricity_consume_residential  :  0.0 %
electricity_consume_commercial  :  0.0 %
electricity_consume_industrial  :  0.0 %
electricity_consume_total  :  0.0 %
household_count  :  0.0 %
average_household_size  :  0.0 %
housing_unit_count  :  0.0 %
housing_unit_occupied_count  :  0.0 %
housing_unit_median_value  :  0.0 %
housing_unit_median_gross_rent  :  0.0 %
heating_design_temperature  :  0.01 %
cooling_design_temperature  :  0.01 %
earth_temperature_amplitude  :  0.01 %
frost_days  :  0.01 %
air_temperature  :  0.01 %
relative_humidity  :  0.01 %
daily_solar_radiation  :  0.01 %
atmospheric_pressure  :  0.01 %
wind_speed  :  0.01 %
earth_temperature  :  0.01 %
heating_degree_days  :  0.01 %
cooling_degree_days  :  0.01 %
age_18_24_rate  :  0.0 %
age_25_34_rate  :  0.0 %
age_more_than_85_rate  :  0.0 %
age_75_84_rate  :  0.0 %
age_35_44_rate  :  0.0 %
age_45_54_rate  :  0.0 %
age_65_74_rate  :  0.0 %
age_55_64_rate  :  0.0 %
age_10_14_rate  :  0.0 %
age_15_17_rate  :  0.0 %
age_5_9_rate  :  0.0 %
household_type_family_rate  :  0.0 %
dropout_16_19_inschool_rate  :  0.0 %
occupation_construction_rate  :  0.0 %
occupation_public_rate  :  0.0 %
occupation_information_rate  :  0.0 %
occupation_finance_rate  :  0.0 %
occupation_education_rate  :  0.0 %
occupation_administrative_rate  :  0.0 %
occupation_manufacturing_rate  :  0.0 %
occupation_wholesale_rate  :  0.0 %
occupation_retail_rate  :  0.0 %
occupation_transportation_rate  :  0.0 %
occupation_arts_rate  :  0.0 %
occupation_agriculture_rate  :  0.0 %
occupancy_vacant_rate  :  0.0 %
occupancy_owner_rate  :  0.0 %
mortgage_with_rate  :  0.0 %
transportation_home_rate  :  0.0 %
transportation_car_alone_rate  :  0.0 %
transportation_walk_rate  :  0.0 %
transportation_carpool_rate  :  0.0 %
transportation_motorcycle_rate  :  0.0 %
transportation_bicycle_rate  :  0.0 %
transportation_public_rate  :  0.0 %
travel_time_less_than_10_rate  :  0.0 %
travel_time_10_19_rate  :  0.0 %
travel_time_20_29_rate  :  0.0 %
travel_time_30_39_rate  :  0.0 %
travel_time_40_59_rate  :  0.0 %
travel_time_60_89_rate  :  0.0 %
health_insurance_public_rate  :  0.0 %
health_insurance_none_rate  :  0.0 %
age_median  :  0.0 %
travel_time_average  :  0.0 %
voting_2016_dem_percentage  :  0.0 %
voting_2016_gop_percentage  :  0.0 %
voting_2012_dem_percentage  :  0.27 %
voting_2012_gop_percentage  :  0.27 %
number_of_years_of_education  :  0.0 %
diversity  :  0.0 %
incentive_count_residential  :  0.0 %
incentive_count_nonresidential  :  0.0 %
incentive_residential_state_level  :  0.0 %
incentive_nonresidential_state_level  :  0.0 %
net_metering  :  0.0 %
feedin_tariff  :  0.0 %
cooperate_tax  :  0.0 %
property_tax  :  0.0 %
sales_tax  :  0.0 %
rebate  :  0.0 %
avg_electricity_retail_rate  :  0.0 %
adoption  :  0.0 %
county_code  :  0.0 %
state_code  :  0.0 %
In [20]:
county_impute = ['average_household_income','land_area','per_capita_income','population_density',
                 'total_area','water_area','employ_rate','poverty_family_below_poverty_level_rate',
                 'median_household_income','average_household_size','housing_unit_median_value',
                 'housing_unit_median_gross_rent','age_18_24_rate','age_25_34_rate','age_more_than_85_rate',
                 'age_75_84_rate','age_35_44_rate','age_45_54_rate','age_65_74_rate','age_55_64_rate','age_10_14_rate',
                 'age_15_17_rate','age_5_9_rate','household_type_family_rate','dropout_16_19_inschool_rate',
                 'occupation_construction_rate','occupation_public_rate','occupation_information_rate','occupation_finance_rate',
                 'occupation_education_rate','occupation_administrative_rate','occupation_manufacturing_rate',
                 'occupation_wholesale_rate','occupation_retail_rate','occupation_transportation_rate','occupation_arts_rate',
                 'occupation_agriculture_rate','occupancy_vacant_rate','occupancy_owner_rate','mortgage_with_rate',
                 'transportation_home_rate','transportation_car_alone_rate','transportation_walk_rate','transportation_carpool_rate',
                 'transportation_motorcycle_rate','transportation_bicycle_rate','transportation_public_rate',
                 'travel_time_less_than_10_rate','travel_time_10_19_rate','travel_time_20_29_rate','travel_time_30_39_rate',
                 'travel_time_40_59_rate','travel_time_60_89_rate','health_insurance_public_rate','health_insurance_none_rate',
                 'age_median','travel_time_average','voting_2012_dem_percentage','voting_2012_gop_percentage',
                 'number_of_years_of_education','diversity']

state_impute = ['heating_design_temperature','cooling_design_temperature',
                'earth_temperature_amplitude','frost_days','air_temperature','relative_humidity',
                'daily_solar_radiation','atmospheric_pressure','wind_speed','earth_temperature',
                'heating_degree_days','cooling_degree_days','voting_2012_dem_percentage','voting_2012_gop_percentage']
In [21]:
for col in county_impute:
    df[col] = df.groupby('county')[col].transform(lambda x: x.fillna(x.median()))
    
for col in state_impute:
    df[col] = df.groupby('state')[col].transform(lambda x: x.fillna(x.median()))
c:\users\patrick\anaconda3\envs\venv\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

c:\users\patrick\anaconda3\envs\venv\lib\site-packages\numpy\lib\nanfunctions.py:1113: RuntimeWarning:

Mean of empty slice

c:\users\patrick\anaconda3\envs\venv\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [23]:
df.loc[df.voting_2012_dem_percentage.isnull(),'voting_2012_dem_percentage'] = df.voting_2012_dem_percentage.median()
df.loc[df.voting_2012_gop_percentage.isnull(),'voting_2012_gop_percentage'] = df.voting_2012_gop_percentage.median()
In [24]:
for col in df.columns:
    print(col,' : ',round((df[col].isna().sum()/72537)*100,2),'%')
tile_count  :  0.0 %
average_household_income  :  0.0 %
county  :  0.0 %
education_bachelor  :  0.0 %
education_college  :  0.0 %
education_doctoral  :  0.0 %
education_high_school_graduate  :  0.0 %
education_less_than_high_school  :  0.0 %
education_master  :  0.0 %
education_population  :  0.0 %
education_professional_school  :  0.0 %
land_area  :  0.0 %
per_capita_income  :  0.0 %
population  :  0.0 %
population_density  :  0.0 %
poverty_family_below_poverty_level  :  0.0 %
poverty_family_count  :  0.0 %
state  :  0.0 %
total_area  :  0.0 %
unemployed  :  0.0 %
water_area  :  0.0 %
employ_rate  :  0.0 %
poverty_family_below_poverty_level_rate  :  0.0 %
median_household_income  :  0.0 %
electricity_price_residential  :  0.0 %
electricity_price_commercial  :  0.0 %
electricity_price_industrial  :  0.0 %
electricity_price_transportation  :  0.0 %
electricity_price_overall  :  0.0 %
electricity_consume_residential  :  0.0 %
electricity_consume_commercial  :  0.0 %
electricity_consume_industrial  :  0.0 %
electricity_consume_total  :  0.0 %
household_count  :  0.0 %
average_household_size  :  0.0 %
housing_unit_count  :  0.0 %
housing_unit_occupied_count  :  0.0 %
housing_unit_median_value  :  0.0 %
housing_unit_median_gross_rent  :  0.0 %
heating_design_temperature  :  0.0 %
cooling_design_temperature  :  0.0 %
earth_temperature_amplitude  :  0.0 %
frost_days  :  0.0 %
air_temperature  :  0.0 %
relative_humidity  :  0.0 %
daily_solar_radiation  :  0.0 %
atmospheric_pressure  :  0.0 %
wind_speed  :  0.0 %
earth_temperature  :  0.0 %
heating_degree_days  :  0.0 %
cooling_degree_days  :  0.0 %
age_18_24_rate  :  0.0 %
age_25_34_rate  :  0.0 %
age_more_than_85_rate  :  0.0 %
age_75_84_rate  :  0.0 %
age_35_44_rate  :  0.0 %
age_45_54_rate  :  0.0 %
age_65_74_rate  :  0.0 %
age_55_64_rate  :  0.0 %
age_10_14_rate  :  0.0 %
age_15_17_rate  :  0.0 %
age_5_9_rate  :  0.0 %
household_type_family_rate  :  0.0 %
dropout_16_19_inschool_rate  :  0.0 %
occupation_construction_rate  :  0.0 %
occupation_public_rate  :  0.0 %
occupation_information_rate  :  0.0 %
occupation_finance_rate  :  0.0 %
occupation_education_rate  :  0.0 %
occupation_administrative_rate  :  0.0 %
occupation_manufacturing_rate  :  0.0 %
occupation_wholesale_rate  :  0.0 %
occupation_retail_rate  :  0.0 %
occupation_transportation_rate  :  0.0 %
occupation_arts_rate  :  0.0 %
occupation_agriculture_rate  :  0.0 %
occupancy_vacant_rate  :  0.0 %
occupancy_owner_rate  :  0.0 %
mortgage_with_rate  :  0.0 %
transportation_home_rate  :  0.0 %
transportation_car_alone_rate  :  0.0 %
transportation_walk_rate  :  0.0 %
transportation_carpool_rate  :  0.0 %
transportation_motorcycle_rate  :  0.0 %
transportation_bicycle_rate  :  0.0 %
transportation_public_rate  :  0.0 %
travel_time_less_than_10_rate  :  0.0 %
travel_time_10_19_rate  :  0.0 %
travel_time_20_29_rate  :  0.0 %
travel_time_30_39_rate  :  0.0 %
travel_time_40_59_rate  :  0.0 %
travel_time_60_89_rate  :  0.0 %
health_insurance_public_rate  :  0.0 %
health_insurance_none_rate  :  0.0 %
age_median  :  0.0 %
travel_time_average  :  0.0 %
voting_2016_dem_percentage  :  0.0 %
voting_2016_gop_percentage  :  0.0 %
voting_2012_dem_percentage  :  0.0 %
voting_2012_gop_percentage  :  0.0 %
number_of_years_of_education  :  0.0 %
diversity  :  0.0 %
incentive_count_residential  :  0.0 %
incentive_count_nonresidential  :  0.0 %
incentive_residential_state_level  :  0.0 %
incentive_nonresidential_state_level  :  0.0 %
net_metering  :  0.0 %
feedin_tariff  :  0.0 %
cooperate_tax  :  0.0 %
property_tax  :  0.0 %
sales_tax  :  0.0 %
rebate  :  0.0 %
avg_electricity_retail_rate  :  0.0 %
adoption  :  0.0 %
county_code  :  0.0 %
state_code  :  0.0 %
In [25]:
df = df.dropna()
In [26]:
X = df.drop(['tile_count','adoption','county','state'],axis=1)
y = df['adoption']
X = X.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42,test_size=0.3)
In [28]:
scaler = MinMaxScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(X),columns=X.columns)
In [29]:
scaled_data['adoption'] = df['adoption']
In [30]:
scaled_data = scaled_data.dropna()
In [31]:
X_scaled = scaled_data.drop('adoption',axis=1)
y_scaled = scaled_data['adoption']
X_trains, X_tests, y_trains, y_tests = train_test_split(X_scaled, y_scaled, random_state=42, test_size=0.3)

Random Forest

Althought gradient boosting will perform better, the algorithms is extremely computationally expensive, so for the sake of this analysis, we'll use a Random Forest. The great thing about a Random Forests are their ability to arrive at interpretable conclusions. Through viewing which features ultimately lead to the highest information gain across the estimators, we can see the features which provide the highest information gain across the forest of decision trees. Intuitively, we can observe the feature importance as a proxy to the real life predictive power of each attribute associated with the classification of solar adoption.

In [32]:
from sklearn.ensemble import RandomForestClassifier
In [186]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train,y_train)
Out[186]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
In [187]:
rf_pred = rf.predict(X_test)
print(classification_report(y_test,rf_pred))
              precision    recall  f1-score   support

         0.0       0.70      0.45      0.55      4892
         1.0       0.86      0.94      0.90     16867

    accuracy                           0.83     21759
   macro avg       0.78      0.70      0.72     21759
weighted avg       0.82      0.83      0.82     21759

In [ ]:
rf_auc = GridSearchCV(rf,param_grid={'n_estimators':np.arange(80,240,20),
                                     'criterion':['gini','entropy'],
                                     'max_depth':[5,10,25,50,100],
                                     'min_samples_split':[10,50,250,500]},
                      scoring='roc_auc',
                      cv=5,
                      n_jobs=-1)
rf_auc.fit(X_train,y_train)

Parameter Tuning

I've elected to tune the model with the cross validated grid search that yeilds the highest ROC Area Under the Curve. Since more of the United States has adopted solar than not, either F1-score or the AUC are better to use, as to avoid the class imbalance driven misguidance of relying in accuracy as a measurement of model performance. Roughly 77% of counties (fips identifiaction regions) have adopted PV as part of the energy infrastructure at some scale. We'd like to avoid developing a model that evolves into a simple majority vote classifier, so accuracy is not an ideal indicator for our use case.

In [33]:
rf_clf = RandomForestClassifier(n_estimators=500,
                                max_depth=25,
                                criterion='entropy',
                                min_samples_split=10,
                                random_state=42,
                                n_jobs=-1)
rf_clf.fit(X_train,y_train)
Out[33]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=25, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Random Forest Performance Evaluation

The tuned Random Forest performs well in predicting the adoption of solar panel generation across the U.S. Depending on the use case scenario of this data, we may want to tune for optimizing different metrics. For instance, this model performs better on the majority class: counties with solar, and not as well on the minority class. Considering the metrics seen below, rather than further tuning, we may want to resort to some resampling methods or dimensionality reduction in the future.

In [34]:
rfclf_pred = rf_clf.predict(X_test)
print(classification_report(y_test,rfclf_pred))
              precision    recall  f1-score   support

         0.0       0.71      0.44      0.54      4892
         1.0       0.85      0.95      0.90     16867

    accuracy                           0.83     21759
   macro avg       0.78      0.69      0.72     21759
weighted avg       0.82      0.83      0.82     21759

In [69]:
plot_prob_auc(rf_clf,X_test,y_test,model_name='Random Forest')
No Skill Rate AUC:  0.5
Learned AUC:  0.8746644604139518

Artificial Neural Network

We can see below the construction of a perceptron using the sigomid function at the output layer and relu at the input layer. Using the Nesterov momentum algorithm with the Adam gradient optimization algorithm, or Nadam for short. We can see the accuracy and auc after each epoch and notice that we have an early stopping criteria as well as the use of binary crossentropy as our loss function to measure for backpropogation weight and bias updates.

In [197]:
from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import initializers, optimizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import AUC
In [233]:
perc = Sequential()
perc.add(Dense(112,
               input_dim=112,
               activation='relu',
               kernel_initializer=initializers.glorot_uniform(seed=42),
               bias_initializer='zeros'))
perc.add(Dense(1,
               input_dim=112,
               activation='sigmoid',
               kernel_initializer=initializers.glorot_uniform(seed=42),
               bias_initializer='zeros'))
In [234]:
earlystop_callback = EarlyStopping(monitor='accuracy',
                                   min_delta=0.0001,
                                   patience=7)
In [235]:
auc = AUC()
In [236]:
perc.compile(optimizer='nadam',loss='binary_crossentropy',metrics=['accuracy',auc])
perc.fit(X_trains.values,y_trains.values,epochs=100,callbacks=[earlystop_callback])
Train on 50766 samples
Epoch 1/100
50766/50766 [==============================] - 2s 40us/sample - loss: 0.4367 - accuracy: 0.7918 - auc_9: 0.7869
Epoch 2/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.4186 - accuracy: 0.7994 - auc_9: 0.8101
Epoch 3/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.4126 - accuracy: 0.8033 - auc_9: 0.8166
Epoch 4/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.4087 - accuracy: 0.8045 - auc_9: 0.8209
Epoch 5/100
50766/50766 [==============================] - 2s 31us/sample - loss: 0.4051 - accuracy: 0.8067 - auc_9: 0.8249
Epoch 6/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.4028 - accuracy: 0.8072 - auc_9: 0.8274
Epoch 7/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3997 - accuracy: 0.8095 - auc_9: 0.8306
Epoch 8/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3978 - accuracy: 0.8100 - auc_9: 0.8328
Epoch 9/100
50766/50766 [==============================] - 2s 31us/sample - loss: 0.3960 - accuracy: 0.8116 - auc_9: 0.8349
Epoch 10/100
50766/50766 [==============================] - 2s 31us/sample - loss: 0.3950 - accuracy: 0.8104 - auc_9: 0.8357
Epoch 11/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3933 - accuracy: 0.8110 - auc_9: 0.8379
Epoch 12/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3915 - accuracy: 0.8107 - auc_9: 0.8395
Epoch 13/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3904 - accuracy: 0.8123 - auc_9: 0.8405
Epoch 14/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3894 - accuracy: 0.8124 - auc_9: 0.8413
Epoch 15/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3881 - accuracy: 0.8139 - auc_9: 0.8429
Epoch 16/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3877 - accuracy: 0.8135 - auc_9: 0.8433
Epoch 17/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3866 - accuracy: 0.8146 - auc_9: 0.8444
Epoch 18/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3853 - accuracy: 0.8163 - auc_9: 0.8455
Epoch 19/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3847 - accuracy: 0.8162 - auc_9: 0.8457
Epoch 20/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3837 - accuracy: 0.8160 - auc_9: 0.8472
Epoch 21/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3836 - accuracy: 0.8161 - auc_9: 0.8470
Epoch 22/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3821 - accuracy: 0.8160 - auc_9: 0.8485
Epoch 23/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3818 - accuracy: 0.8176 - auc_9: 0.8487
Epoch 24/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3810 - accuracy: 0.8169 - auc_9: 0.8493
Epoch 25/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3805 - accuracy: 0.8182 - auc_9: 0.8499
Epoch 26/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3796 - accuracy: 0.8181 - auc_9: 0.8509
Epoch 27/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3787 - accuracy: 0.8180 - auc_9: 0.8517
Epoch 28/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3781 - accuracy: 0.8189 - auc_9: 0.8522
Epoch 29/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3783 - accuracy: 0.8188 - auc_9: 0.8521
Epoch 30/100
50766/50766 [==============================] - 2s 31us/sample - loss: 0.3770 - accuracy: 0.8199 - auc_9: 0.8531
Epoch 31/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3764 - accuracy: 0.8210 - auc_9: 0.8539
Epoch 32/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3768 - accuracy: 0.8197 - auc_9: 0.8535
Epoch 33/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3754 - accuracy: 0.8202 - auc_9: 0.8547
Epoch 34/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3743 - accuracy: 0.8208 - auc_9: 0.8557
Epoch 35/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3744 - accuracy: 0.8207 - auc_9: 0.8557
Epoch 36/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3741 - accuracy: 0.8218 - auc_9: 0.8557
Epoch 37/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3729 - accuracy: 0.8227 - auc_9: 0.8566
Epoch 38/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3723 - accuracy: 0.8203 - auc_9: 0.8575
Epoch 39/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3717 - accuracy: 0.8220 - auc_9: 0.8577
Epoch 40/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3718 - accuracy: 0.8216 - auc_9: 0.8578
Epoch 41/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3712 - accuracy: 0.8220 - auc_9: 0.8582
Epoch 42/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3701 - accuracy: 0.8229 - auc_9: 0.8596
Epoch 43/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3701 - accuracy: 0.8231 - auc_9: 0.8593
Epoch 44/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3698 - accuracy: 0.8227 - auc_9: 0.8599
Epoch 45/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3689 - accuracy: 0.8242 - auc_9: 0.8604
Epoch 46/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3687 - accuracy: 0.8226 - auc_9: 0.8607
Epoch 47/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3683 - accuracy: 0.8231 - auc_9: 0.8608
Epoch 48/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3673 - accuracy: 0.8244 - auc_9: 0.8621
Epoch 49/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3664 - accuracy: 0.8249 - auc_9: 0.8627
Epoch 50/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3668 - accuracy: 0.8246 - auc_9: 0.8625
Epoch 51/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3658 - accuracy: 0.8253 - auc_9: 0.8631
Epoch 52/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3658 - accuracy: 0.8245 - auc_9: 0.8635
Epoch 53/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3646 - accuracy: 0.8255 - auc_9: 0.8643
Epoch 54/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3645 - accuracy: 0.8257 - auc_9: 0.8645
Epoch 55/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3638 - accuracy: 0.8264 - auc_9: 0.8650
Epoch 56/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3641 - accuracy: 0.8251 - auc_9: 0.8649
Epoch 57/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3634 - accuracy: 0.8262 - auc_9: 0.8654
Epoch 58/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3619 - accuracy: 0.8257 - auc_9: 0.8668
Epoch 59/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3622 - accuracy: 0.8273 - auc_9: 0.8663
Epoch 60/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3617 - accuracy: 0.8279 - auc_9: 0.8671
Epoch 61/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3615 - accuracy: 0.8271 - auc_9: 0.8671
Epoch 62/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3606 - accuracy: 0.8279 - auc_9: 0.8678
Epoch 63/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3609 - accuracy: 0.8276 - auc_9: 0.8677
Epoch 64/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3597 - accuracy: 0.8276 - auc_9: 0.8689
Epoch 65/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3596 - accuracy: 0.8282 - auc_9: 0.8687
Epoch 66/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3587 - accuracy: 0.8279 - auc_9: 0.8695
Epoch 67/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3585 - accuracy: 0.8289 - auc_9: 0.8698
Epoch 68/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3583 - accuracy: 0.8289 - auc_9: 0.8697
Epoch 69/100
50766/50766 [==============================] - 2s 31us/sample - loss: 0.3577 - accuracy: 0.8298 - auc_9: 0.8701
Epoch 70/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3571 - accuracy: 0.8296 - auc_9: 0.8710
Epoch 71/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3571 - accuracy: 0.8296 - auc_9: 0.8708
Epoch 72/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3564 - accuracy: 0.8316 - auc_9: 0.8715
Epoch 73/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3566 - accuracy: 0.8280 - auc_9: 0.8712
Epoch 74/100
50766/50766 [==============================] - 2s 30us/sample - loss: 0.3552 - accuracy: 0.8313 - auc_9: 0.8726
Epoch 75/100
50766/50766 [==============================] - 2s 31us/sample - loss: 0.3553 - accuracy: 0.8307 - auc_9: 0.8724
Epoch 76/100
50766/50766 [==============================] - 1s 30us/sample - loss: 0.3552 - accuracy: 0.8307 - auc_9: 0.8722
Epoch 77/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3544 - accuracy: 0.8302 - auc_9: 0.8734
Epoch 78/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3538 - accuracy: 0.8314 - auc_9: 0.8735
Epoch 79/100
50766/50766 [==============================] - 1s 29us/sample - loss: 0.3538 - accuracy: 0.8311 - auc_9: 0.8735
Out[236]:
<tensorflow.python.keras.callbacks.History at 0x1f721327c88>
In [237]:
plot_ann_auc(perc,X_tests.values,y_tests.values,model_name='Perceptron')
No Skill Rate AUC:  0.5
Learned AUC:  0.828785782575694
In [238]:
perc_pred = perc.predict_classes(X_tests.values).flatten()
print(classification_report(y_tests,perc_pred))
              precision    recall  f1-score   support

         0.0       0.56      0.48      0.52      4889
         1.0       0.86      0.89      0.87     16868

    accuracy                           0.80     21757
   macro avg       0.71      0.68      0.69     21757
weighted avg       0.79      0.80      0.79     21757

Additional Layers and Dropout Regularization

As can be seen in the perceptron network, the network appears to be overfitting to the train dataset, returning an AUC and accuracy that's higher than it is on our test data. Though the performance on the minority class is better on the test data, we can see that the AUC is still lower than the train data and lower than the Random Forest performance. To avoid this overfitting we can attempt to include some more layers to include more weight and bias values to update and include dropout layers to help the model parse out co-evolutionary nodes.

In [241]:
ann = Sequential()
ann.add(Dense(112,
              input_dim=112,
              activation='relu',
              kernel_initializer=initializers.glorot_uniform(seed=42),
              bias_initializer='zeros'))
ann.add(Dense(224,
              input_dim=112,
              activation='relu',
              kernel_initializer=initializers.glorot_uniform(seed=42),
              bias_initializer='zeros'))
ann.add(Dropout(0.5,seed=42))
ann.add(Dense(112,
              input_dim=112,
              activation='relu',
              kernel_initializer=initializers.glorot_uniform(seed=42),
              bias_initializer='zeros'))
ann.add(Dense(224,
              input_dim=112,
              activation='relu',
              kernel_initializer=initializers.glorot_uniform(seed=42),
              bias_initializer='zeros'))
ann.add(Dropout(0.5,seed=42))
ann.add(Dense(112,
              input_dim=112,
              activation='relu',
              kernel_initializer=initializers.glorot_uniform(seed=42),
              bias_initializer='zeros'))
ann.add(Dense(1,
              input_dim=112,
              activation='sigmoid',
              kernel_initializer=initializers.glorot_uniform(seed=42),
              bias_initializer='zeros'))
In [242]:
ann.compile(optimizer='nadam',loss='binary_crossentropy',metrics=['accuracy',auc])
ann.fit(X_trains.values,y_trains.values,epochs=100,callbacks=[earlystop_callback])
Train on 50766 samples
Epoch 1/100
50766/50766 [==============================] - 5s 101us/sample - loss: 0.4410 - accuracy: 0.7893 - auc_9: 0.7813
Epoch 2/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.4180 - accuracy: 0.8009 - auc_9: 0.8108
Epoch 3/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.4115 - accuracy: 0.8044 - auc_9: 0.8189
Epoch 4/100
50766/50766 [==============================] - 4s 69us/sample - loss: 0.4072 - accuracy: 0.8057 - auc_9: 0.8232
Epoch 5/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.4050 - accuracy: 0.8050 - auc_9: 0.8258
Epoch 6/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.4010 - accuracy: 0.8074 - auc_9: 0.8297
Epoch 7/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.4002 - accuracy: 0.8059 - auc_9: 0.8307
Epoch 8/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3978 - accuracy: 0.8077 - auc_9: 0.8329
Epoch 9/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3973 - accuracy: 0.8089 - auc_9: 0.8332
Epoch 10/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3958 - accuracy: 0.8098 - auc_9: 0.8352
Epoch 11/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3945 - accuracy: 0.8092 - auc_9: 0.8366
Epoch 12/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3936 - accuracy: 0.8102 - auc_9: 0.8373
Epoch 13/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3912 - accuracy: 0.8110 - auc_9: 0.8400
Epoch 14/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3908 - accuracy: 0.8130 - auc_9: 0.8405
Epoch 15/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3898 - accuracy: 0.8123 - auc_9: 0.8412
Epoch 16/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3890 - accuracy: 0.8116 - auc_9: 0.8419
Epoch 17/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3884 - accuracy: 0.8116 - auc_9: 0.8423
Epoch 18/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3879 - accuracy: 0.8129 - auc_9: 0.8430
Epoch 19/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3863 - accuracy: 0.8130 - auc_9: 0.8446
Epoch 20/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3862 - accuracy: 0.8127 - auc_9: 0.8446
Epoch 21/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3849 - accuracy: 0.8132 - auc_9: 0.8460
Epoch 22/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3844 - accuracy: 0.8135 - auc_9: 0.8464
Epoch 23/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3826 - accuracy: 0.8146 - auc_9: 0.8476
Epoch 24/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3820 - accuracy: 0.8142 - auc_9: 0.8481
Epoch 25/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3820 - accuracy: 0.8155 - auc_9: 0.8488
Epoch 26/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3816 - accuracy: 0.8150 - auc_9: 0.8487
Epoch 27/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3818 - accuracy: 0.8145 - auc_9: 0.8488
Epoch 28/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3808 - accuracy: 0.8154 - auc_9: 0.8493
Epoch 29/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3804 - accuracy: 0.8149 - auc_9: 0.8499
Epoch 30/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3812 - accuracy: 0.8145 - auc_9: 0.8497
Epoch 31/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3787 - accuracy: 0.8158 - auc_9: 0.8515
Epoch 32/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3772 - accuracy: 0.8167 - auc_9: 0.8527
Epoch 33/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3773 - accuracy: 0.8172 - auc_9: 0.8528
Epoch 34/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3770 - accuracy: 0.8182 - auc_9: 0.8532
Epoch 35/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3760 - accuracy: 0.8167 - auc_9: 0.8536
Epoch 36/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3754 - accuracy: 0.8180 - auc_9: 0.8548
Epoch 37/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3759 - accuracy: 0.8181 - auc_9: 0.8545
Epoch 38/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3751 - accuracy: 0.8182 - auc_9: 0.8548
Epoch 39/100
50766/50766 [==============================] - 4s 73us/sample - loss: 0.3760 - accuracy: 0.8191 - auc_9: 0.8539
Epoch 40/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3745 - accuracy: 0.8185 - auc_9: 0.8552
Epoch 41/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3738 - accuracy: 0.8203 - auc_9: 0.8559
Epoch 42/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3742 - accuracy: 0.8208 - auc_9: 0.8556
Epoch 43/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3732 - accuracy: 0.8184 - auc_9: 0.8565
Epoch 44/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3722 - accuracy: 0.8193 - auc_9: 0.8568
Epoch 45/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3727 - accuracy: 0.8201 - auc_9: 0.8574
Epoch 46/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3719 - accuracy: 0.8192 - auc_9: 0.8571
Epoch 47/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3718 - accuracy: 0.8193 - auc_9: 0.8576
Epoch 48/100
50766/50766 [==============================] - 4s 85us/sample - loss: 0.3716 - accuracy: 0.8200 - auc_9: 0.8578
Epoch 49/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3714 - accuracy: 0.8210 - auc_9: 0.8582
Epoch 50/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3708 - accuracy: 0.8217 - auc_9: 0.8583
Epoch 51/100
50766/50766 [==============================] - 4s 73us/sample - loss: 0.3695 - accuracy: 0.8203 - auc_9: 0.8594
Epoch 52/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3691 - accuracy: 0.8207 - auc_9: 0.8600
Epoch 53/100
50766/50766 [==============================] - 4s 74us/sample - loss: 0.3681 - accuracy: 0.8215 - auc_9: 0.8609
Epoch 54/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3692 - accuracy: 0.8206 - auc_9: 0.8600
Epoch 55/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3683 - accuracy: 0.8230 - auc_9: 0.8604
Epoch 56/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3683 - accuracy: 0.8230 - auc_9: 0.8606
Epoch 57/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3669 - accuracy: 0.8217 - auc_9: 0.8618
Epoch 58/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3680 - accuracy: 0.8228 - auc_9: 0.8604
Epoch 59/100
50766/50766 [==============================] - 4s 78us/sample - loss: 0.3677 - accuracy: 0.8211 - auc_9: 0.8609
Epoch 60/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3665 - accuracy: 0.8216 - auc_9: 0.8622
Epoch 61/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3661 - accuracy: 0.8231 - auc_9: 0.8629
Epoch 62/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3666 - accuracy: 0.8236 - auc_9: 0.8625
Epoch 63/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3666 - accuracy: 0.8223 - auc_9: 0.8620
Epoch 64/100
50766/50766 [==============================] - 4s 71us/sample - loss: 0.3653 - accuracy: 0.8225 - auc_9: 0.8633
Epoch 65/100
50766/50766 [==============================] - 4s 72us/sample - loss: 0.3651 - accuracy: 0.8241 - auc_9: 0.8636
Epoch 66/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3645 - accuracy: 0.8218 - auc_9: 0.8638
Epoch 67/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3649 - accuracy: 0.8223 - auc_9: 0.8634
Epoch 68/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3647 - accuracy: 0.8234 - auc_9: 0.8637
Epoch 69/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3637 - accuracy: 0.8241 - auc_9: 0.8650
Epoch 70/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3644 - accuracy: 0.8238 - auc_9: 0.8641
Epoch 71/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3653 - accuracy: 0.8219 - auc_9: 0.8632
Epoch 72/100
50766/50766 [==============================] - 4s 70us/sample - loss: 0.3635 - accuracy: 0.8231 - auc_9: 0.8643
Out[242]:
<tensorflow.python.keras.callbacks.History at 0x1f78e57dfc8>

Neural Network Performance

With a slight increase in AUC from the perceptron model, we can see that including dropout layers slightly increases AUC. However, we've improved overall model performance at the cost of some small amount of minority class performance. It's important to remember that these models both have their slight advantages and disadvantages depending on the use case, however they perform virtually the same even with dropout regularization applied.

In [243]:
plot_ann_auc(ann,X_tests.values,y_tests.values,model_name='ANN W/ Dropout')
No Skill Rate AUC:  0.5
Learned AUC:  0.8320995728118947
In [244]:
ann_pred = ann.predict_classes(X_tests.values).flatten()
print(classification_report(y_tests,ann_pred))
              precision    recall  f1-score   support

         0.0       0.59      0.37      0.45      4889
         1.0       0.84      0.93      0.88     16868

    accuracy                           0.80     21757
   macro avg       0.71      0.65      0.67     21757
weighted avg       0.78      0.80      0.78     21757

In [245]:
print("Perceptron:\n",classification_report(y_tests,perc_pred),"ANN W/ Dropout Regularization:\n",classification_report(y_tests,ann_pred))
Perceptron:
               precision    recall  f1-score   support

         0.0       0.56      0.48      0.52      4889
         1.0       0.86      0.89      0.87     16868

    accuracy                           0.80     21757
   macro avg       0.71      0.68      0.69     21757
weighted avg       0.79      0.80      0.79     21757
 ANN W/ Dropout Regularization:
               precision    recall  f1-score   support

         0.0       0.59      0.37      0.45      4889
         1.0       0.84      0.93      0.88     16868

    accuracy                           0.80     21757
   macro avg       0.71      0.65      0.67     21757
weighted avg       0.78      0.80      0.78     21757

Conclusion and Domain Discussion

Random Forest performed better overall on the dataset, though the neural networks performed well also. We can see below the comparison of each models performance.

In [247]:
print("Random Forest:\n",classification_report(y_test,rf_pred),"\nPerceptron:\n",classification_report(y_tests,perc_pred),"\nANN W/ Dropout Regularization:\n",classification_report(y_tests,ann_pred))
Random Forest:
               precision    recall  f1-score   support

         0.0       0.70      0.45      0.55      4892
         1.0       0.86      0.94      0.90     16867

    accuracy                           0.83     21759
   macro avg       0.78      0.70      0.72     21759
weighted avg       0.82      0.83      0.82     21759
 
Perceptron:
               precision    recall  f1-score   support

         0.0       0.56      0.48      0.52      4889
         1.0       0.86      0.89      0.87     16868

    accuracy                           0.80     21757
   macro avg       0.71      0.68      0.69     21757
weighted avg       0.79      0.80      0.79     21757
 
ANN W/ Dropout Regularization:
               precision    recall  f1-score   support

         0.0       0.59      0.37      0.45      4889
         1.0       0.84      0.93      0.88     16868

    accuracy                           0.80     21757
   macro avg       0.71      0.65      0.67     21757
weighted avg       0.78      0.80      0.78     21757

In [177]:
importances = rf_clf.feature_importances_
indices = np.argsort(importances)
features = X_trains.columns

Benefit of Random Forest Interpretability

As mentioned earlier, a nice bonus to the Random Forest is the fact that we can see which features yield the highest information gain and see which features are most important. We can see the feature importance in descending order in the graph below, and surprisingly, it appears that the quality of grid infrastructure and the prevalence of economic incentives are less important to make a prediction of solar adoption. Additionally, the features that seem to be most important infer that there is a mix of importance primarily revolving around regional education, incomes and weather.

In [183]:
plt.figure(figsize=(20,20))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='g', align='center')
plt.yticks(range(len(indices)), features[indices],fontsize=12)
plt.xlabel('Relative Importance')
Out[183]:
Text(0.5, 0, 'Relative Importance')
In [ ]: